Introduction

Estimated reading time: 47 minutes

Introduction

Collecting the Data

Import pandas and numpy

import pandas as pd
import numpy as np

Let’s take a look at the files in our input directory, using the convenient shell commands in ipython.

!ls data
311-service-requests.csv
All-Web-Site-Data-Audience-Overview.xlsx
Aussie_Wines_Plotting.csv
March-2017-forecast-article.xlsx
Traffic_20170306-20170519.xlsx
Tweets.csv
airbnb_session_data.txt
customer-status.xlsx
debtors.csv
excel_sample.xlsx
general-ledger-sample.xlsx
mn-budget-detail-2014.csv
pnl.xlsx
population.xlsx
sales-estimate.xlsx
sales-feb-2014.xlsx
sales-jan-2014.xlsx
sales-mar-2014.xlsx
sales_data_types.csv
sales_transactions.xlsx
salesfunnel.xlsx
sample-sales-reps.xlsx
sample-sales-tax.csv
sample-salesv3.xlsx
tb_import.xlsx
~$general-ledger-sample.xlsx

There are a lot of files, but we only want to look at the sales .xlsx files.

!ls data/sales-*-2014.xlsx
data/sales-feb-2014.xlsx data/sales-jan-2014.xlsx data/sales-mar-2014.xlsx

Use the python glob module to easily list out the files we need

import glob
glob.glob("data/sales-*-2014.xlsx")
['data/sales-feb-2014.xlsx',
 'data/sales-jan-2014.xlsx',
 'data/sales-mar-2014.xlsx']

This gives us what we need, let’s import each of our files and combine them into one file.

Panda’s concat and append can do this for us. I’m going to use append in this example.

The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame.

all_data = pd.DataFrame()
for f in glob.glob("data/sales-*-2014.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good.

all_data.shape
(384, 7)
all_data.describe()
account number quantity unit price ext price
count 384.000000 384.000000 384.000000 384.000000
mean 478125.989583 24.372396 56.651406 1394.517344
std 220902.947401 14.373219 27.075883 1117.809743
min 141962.000000 -1.000000 10.210000 -97.160000
25% 257198.000000 12.000000 32.612500 482.745000
50% 424914.000000 23.500000 58.160000 1098.710000
75% 714466.000000 37.000000 80.965000 2132.260000
max 786968.000000 49.000000 99.730000 4590.810000

Alot of this data may not make much sense for this data set but I’m most interested in the count row to make sure the number of data elements makes sense.

all_data.head()
account number name sku quantity unit price ext price date
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20

It is not critical in this example but the best practice is to convert the date column to a date time object.

all_data['date'] = pd.to_datetime(all_data['date'])

Combining Data

Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company’s customer segmentation strategy or some other mechanism for identifying their customers.

First, we read in the data.

status = pd.read_excel("data/customer-status.xlsx")
status
account number name status
0 740150 Barton LLC gold
1 714466 Trantow-Barrows silver
2 218895 Kulas Inc bronze
3 307599 Kassulke, Ondricka and Metz bronze
4 412290 Jerde-Hilpert bronze
5 729833 Koepp Ltd silver
6 146832 Kiehn-Spinka silver
7 688981 Keeling LLC silver
8 786968 Frami, Hills and Schmidt silver
9 239344 Stokes LLC gold
10 672390 Kuhn-Gusikowski silver
11 141962 Herman LLC gold
12 424914 White-Trantow silver
13 527099 Sanford and Sons bronze
14 642753 Pollich LLC bronze
15 257198 Cronin, Oberbrunner and Spencer gold

We want to merge this data with our concatenated data set of sales. We use panda’s merge function and tell it to do a left join which is similar to Excel’s vlookup function.

all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 NaN
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20 silver

This looks pretty good but let’s look at a specific account.

all_data_st[all_data_st["account number"]==737550].head()
account number name sku quantity unit price ext price date status
15 737550 Fritsch, Russel and Anderson S1-47412 40 51.01 2040.40 2014-02-05 01:20:40 NaN
25 737550 Fritsch, Russel and Anderson S1-06532 34 18.69 635.46 2014-02-07 09:22:02 NaN
66 737550 Fritsch, Russel and Anderson S1-27722 15 70.23 1053.45 2014-02-16 18:24:42 NaN
78 737550 Fritsch, Russel and Anderson S2-34077 26 93.35 2427.10 2014-02-20 18:45:43 NaN
80 737550 Fritsch, Russel and Anderson S1-93683 31 10.52 326.12 2014-02-21 13:55:45 NaN

This account number was not in our status file, so we have a bunch of NaN’s. We can decide how we want to handle this situation. For this specific case, let’s label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column.

all_data_st['status'].fillna('bronze',inplace=True)
all_data_st.head()
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20 silver

Check the data just to make sure we’re all good.

all_data_st[all_data_st["account number"]==737550].head()
account number name sku quantity unit price ext price date status
15 737550 Fritsch, Russel and Anderson S1-47412 40 51.01 2040.40 2014-02-05 01:20:40 bronze
25 737550 Fritsch, Russel and Anderson S1-06532 34 18.69 635.46 2014-02-07 09:22:02 bronze
66 737550 Fritsch, Russel and Anderson S1-27722 15 70.23 1053.45 2014-02-16 18:24:42 bronze
78 737550 Fritsch, Russel and Anderson S2-34077 26 93.35 2427.10 2014-02-20 18:45:43 bronze
80 737550 Fritsch, Russel and Anderson S1-93683 31 10.52 326.12 2014-02-21 13:55:45 bronze

Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability.

Using Categories

One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -

“Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.”

For our purposes, the status field is a good candidate for a category type.

You must make sure you have a recent version of pandas installed for this example to work.

pd.__version__
'0.22.0'

First, we typecast it to a category using astype.

all_data_st["status"] = all_data_st["status"].astype("category")

This doesn’t immediately appear to change anything yet.

all_data_st.head()
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze
4 672390 Kuhn-Gusikowski S1-06532 48 55.82 2679.36 2014-02-02 03:45:20 silver

Buy you can see that it is a new data type.

all_data_st.dtypes
account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
status                  category
dtype: object

Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically.

all_data_st.sort_values(by=["status"]).head()
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
196 218895 Kulas Inc S2-83881 41 78.27 3209.07 2014-01-20 09:37:58 bronze
197 383080 Will LLC B1-33364 26 90.19 2344.94 2014-01-20 09:39:59 bronze
198 604255 Halvorson, Crona and Champlin S2-11481 37 96.71 3578.27 2014-01-20 13:07:28 bronze
200 527099 Sanford and Sons B1-05914 18 64.32 1157.76 2014-01-20 21:40:58 bronze

We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering.

 all_data_st["status"].cat.set_categories([ "gold","silver","bronze"],inplace=True)

Now, we can sort it so that gold shows on top.

all_data_st.sort_values(by=["status"]).head()
account number name sku quantity unit price ext price date status
68 740150 Barton LLC B1-38851 17 81.22 1380.74 2014-02-17 17:12:16 gold
63 257198 Cronin, Oberbrunner and Spencer S1-27722 28 10.21 285.88 2014-02-15 17:27:44 gold
207 740150 Barton LLC B1-86481 20 30.41 608.20 2014-01-22 16:33:51 gold
61 740150 Barton LLC B1-20000 28 81.39 2278.92 2014-02-15 07:45:16 gold
60 239344 Stokes LLC S2-83881 30 43.00 1290.00 2014-02-15 02:13:23 gold
all_data_st["status"].describe()
count        384
unique         3
top       bronze
freq         172
Name: status, dtype: object

For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values.

all_data_st.groupby(["status"])["quantity","unit price","ext price"].mean()
quantity unit price ext price
status
gold 24.375000 53.723889 1351.944583
silver 22.842857 57.272714 1320.032214
bronze 25.616279 57.371163 1472.965930

Of course, you can run multiple aggregation functions on the data to get really useful information

all_data_st.groupby(["status"])["quantity","unit price","ext price"].agg([np.sum,np.mean, np.std])
quantity unit price ext price
sum mean std sum mean std sum mean std
status
gold 1755 24.375000 14.575145 3868.12 53.723889 28.740080 97340.01 1351.944583 1182.657312
silver 3198 22.842857 14.512843 8018.18 57.272714 26.556242 184804.51 1320.032214 1086.384051
bronze 4406 25.616279 14.136071 9867.84 57.371163 26.857370 253350.14 1472.965930 1116.683843

So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.

Maybe we should look at how many bronze customers we have and see what is going on.

What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.

I’m purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean.

all_data_st.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1,7]].groupby(["status"])["name"].count()
status
gold      4
silver    7
bronze    9
Name: name, dtype: int64

Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers.

all_data_st.head(4)
account number name sku quantity unit price ext price date status
0 383080 Will LLC B1-20000 7 33.69 235.83 2014-02-01 09:04:59 bronze
1 412290 Jerde-Hilpert S1-27722 11 21.12 232.32 2014-02-01 11:51:46 bronze
2 412290 Jerde-Hilpert B1-86481 3 35.99 107.97 2014-02-01 17:24:32 bronze
3 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70 2014-02-01 19:56:48 bronze

We will start over and import a clean file

df = pd.read_excel("data/sample-sales-reps.xlsx")

## Set default commision of 3%
df["commission"] = .03
df.head()
account number customer name sales rep sku category quantity unit price ext price date commission
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.03
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.03
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.03
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.03
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.03

Since shirts are high margin, adjust all products in the shirt categort with a commission rate of 5%

df.loc[df["category"] == "Shirt", ["commission"]] = .05
df.head()
account number customer name sales rep sku category quantity unit price ext price date commission
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.03
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.05
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.05
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.05
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.05

Since there is a special program for selling 10 or more belts in a transaction, you get 7% commission!

df.loc[(df["category"] == "Belt") & (df["quantity"] >= 10), ["commission"]] = .04
df.head()
account number customer name sales rep sku category quantity unit price ext price date commission
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.04
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.05
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.05
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.05
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.05

Finally, some transactions can get a bonus and a commission increase.

df["bonus"] = 0
df.loc[(df["category"] == "Shoes") & (df["ext price"] >= 1000 ), ["bonus", "commission"]] = 250, 0.045

Calculate the compensation at the line item level

df["comp"] = df["commission"] * df["ext price"] + df["bonus"]
df.head()
account number customer name sales rep sku category quantity unit price ext price date commission bonus comp
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.04 0 67.2524
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.05 0 11.7105
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.05 0 28.6995
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.05 0 54.1740
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.05 0 18.1600

Calculate the commissions by sales rep

df.groupby(["sales rep"])["comp"].sum().round(2)


sales rep
Ansley Cummings       2699.69
Beth Skiles           3664.16
Esequiel Schinner    12841.28
Loring Predovic      13115.42
Shannen Hudson        6541.78
Teagan O'Keefe       10931.30
Trish Deckow          7641.91
Name: comp, dtype: float64
df["date"] = pd.to_datetime(df['date'])

df["month"] = df["date"].dt.month

Now what about on a monthly basis

df.head()
account number customer name sales rep sku category quantity unit price ext price date commission bonus comp month
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.04 0 67.2524 11
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.05 0 11.7105 2
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.05 0 28.6995 8
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.05 0 54.1740 1
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.05 0 18.1600 8
df.groupby(["month","sales rep"])["comp"].sum().round(2)
month  sales rep        
1      Ansley Cummings       157.10
       Beth Skiles            78.09
       Esequiel Schinner    1481.78
       Loring Predovic       655.96
       Shannen Hudson        319.13
       Teagan O'Keefe        732.10
       Trish Deckow          305.58
2      Ansley Cummings       347.06
       Beth Skiles          1262.36
       Esequiel Schinner     741.87
       Loring Predovic      1794.22
       Shannen Hudson        524.58
       Teagan O'Keefe        893.54
       Trish Deckow          430.81
3      Ansley Cummings       362.13
       Beth Skiles           439.53
       Esequiel Schinner    1323.87
       Loring Predovic      1183.59
       Shannen Hudson        474.90
       Teagan O'Keefe       1064.76
       Trish Deckow          796.12
4      Ansley Cummings       123.27
       Beth Skiles           121.12
       Esequiel Schinner    1478.78
       Loring Predovic       907.41
       Shannen Hudson        514.77
       Teagan O'Keefe        593.64
       Trish Deckow          347.04
5      Ansley Cummings       101.29
       Beth Skiles           347.73
                             ...   
8      Teagan O'Keefe        477.33
       Trish Deckow         1493.39
9      Ansley Cummings       109.70
       Beth Skiles           145.43
       Esequiel Schinner     178.20
       Loring Predovic       978.46
       Shannen Hudson        765.28
       Teagan O'Keefe       1086.80
       Trish Deckow          435.42
10     Ansley Cummings       510.49
       Beth Skiles           204.40
       Esequiel Schinner     994.76
       Loring Predovic      1954.61
       Shannen Hudson        990.48
       Teagan O'Keefe       1111.47
       Trish Deckow          625.12
11     Ansley Cummings        80.41
       Beth Skiles           464.13
       Esequiel Schinner     928.05
       Loring Predovic      1253.52
       Shannen Hudson        567.31
       Teagan O'Keefe        554.03
       Trish Deckow          281.69
12     Ansley Cummings       288.82
       Beth Skiles           131.40
       Esequiel Schinner    1235.69
       Loring Predovic       660.44
       Shannen Hudson        421.24
       Teagan O'Keefe        421.16
       Trish Deckow          848.57
Name: comp, Length: 84, dtype: float64

You can do this for the entire numerical dataframe, withou creating a month variable like follows

df.set_index('date').groupby('sales rep').resample("M").sum().head(20)
account number quantity unit price ext price commission bonus comp month
sales rep date
Ansley Cummings 2015-06-30 3593984 59 196.43 2214.13 0.165 250 342.04250 24
2015-07-31 3593984 43 152.77 1460.69 0.160 0 58.43250 28
2015-08-31 7187968 77 437.26 4080.37 0.340 0 171.32280 64
2015-09-30 6289472 64 398.20 2691.38 0.280 0 109.69920 63
2015-10-31 6289472 86 437.41 5803.84 0.315 250 510.48870 70
2015-11-30 3593984 25 259.40 1715.97 0.160 0 80.40890 44
2015-12-31 9883456 139 465.39 6820.11 0.450 0 288.81630 132
2016-01-31 7187968 85 413.52 4346.59 0.310 0 157.10410 8
2016-02-29 4492480 56 233.69 2561.57 0.185 250 347.06005 10
2016-03-31 3593984 43 260.11 2553.24 0.175 250 362.13085 12
2016-04-30 4492480 55 276.44 2855.68 0.200 0 123.26960 20
2016-05-31 3593984 36 282.24 2151.62 0.190 0 101.28800 20
2016-06-30 3593984 15 257.26 980.25 0.180 0 47.62630 24
Beth Skiles 2015-06-30 846366 37 161.44 1763.43 0.150 0 88.17150 18
2015-07-31 1692732 64 235.55 2468.03 0.190 0 81.93400 42
2015-08-31 3385464 123 699.43 6275.63 0.470 0 283.95210 96
2015-09-30 1692732 68 326.76 3374.01 0.240 0 145.42650 54
2015-10-31 3103342 106 591.17 5248.15 0.410 0 204.40310 110
2015-11-30 1974854 78 412.76 4496.55 0.315 250 464.13370 77
2015-12-31 2539098 55 480.51 2970.65 0.360 0 131.39730 108

What if you are only interested in Mondays

df.set_index('date').groupby('sales rep').resample("W-Mon").sum().head(20)
account number quantity unit price ext price commission bonus comp month
sales rep date
Ansley Cummings 2015-06-15 1796992 39 72.79 1436.45 0.075 250 309.1255 12
2015-06-22 898496 18 33.15 596.70 0.040 0 23.8680 6
2015-06-29 898496 2 90.49 180.98 0.050 0 9.0490 6
2015-07-06 898496 11 22.99 252.89 0.050 0 12.6445 7
2015-07-13 898496 7 83.34 583.38 0.030 0 17.5014 7
2015-07-20 898496 17 28.10 477.70 0.050 0 23.8850 7
2015-07-27 898496 8 18.34 146.72 0.030 0 4.4016 7
2015-08-03 0 0 0.00 0.00 0.000 0 0.0000 0
2015-08-10 2695488 36 112.58 1191.71 0.120 0 50.4873 24
2015-08-17 0 0 0.00 0.00 0.000 0 0.0000 0
2015-08-24 2695488 30 170.84 1964.79 0.140 0 89.4372 24
2015-08-31 1796992 11 153.84 923.87 0.080 0 31.3983 16
2015-09-07 0 0 0.00 0.00 0.000 0 0.0000 0
2015-09-14 2695488 47 89.22 1432.40 0.120 0 59.1218 27
2015-09-21 1796992 8 162.20 640.40 0.100 0 32.0200 18
2015-09-28 0 0 0.00 0.00 0.000 0 0.0000 0
2015-10-05 1796992 9 146.78 618.58 0.060 0 18.5574 18
2015-10-12 3593984 42 193.83 2216.34 0.170 0 87.6666 40
2015-10-19 0 0 0.00 0.00 0.000 0 0.0000 0
2015-10-26 2695488 44 243.58 3587.50 0.145 250 422.8221 30

for a specific month use: df.groupby([‘name’, ‘sku’, pd.Grouper(key=’date’, freq=’A-DEC’)])[‘ext price’].sum()

Here is an aggregation function that is sometimes usefull

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
ext price quantity unit price
mean 571.75346 10.411333 55.316493
sum 857630.19000 15617.000000 NaN

You can create custom functions

get_max = lambda x: x.value_counts(dropna=False).index[0]
get_max.__name__ = "most frequent" # required for row label

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
ext price quantity unit price sku
mean 571.75346 10.411333 55.316493 NaN
most frequent NaN NaN NaN TL-23025
sum 857630.19000 15617.000000 NaN NaN

It is nice to have an ordered dictionary

import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext price quantity sku
mean 571.75346 10.411333 NaN
most frequent NaN NaN TL-23025
sum 857630.19000 15617.000000 NaN

You can of course use any of the extensions available like below to further analysis and filtering.

import qgrid
from IPython.display import display

qgrid_widget = qgrid.show_grid(df, show_toolbar=True)
## Not sure why this is not working - giving it a skip for now
qgrid_widget
QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…
qgrid_widget.get_changed_df()
account number customer name sales rep sku category quantity unit price ext price date commission bonus comp month
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.040 0 67.2524 11
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.050 0 11.7105 2
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.050 0 28.6995 8
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.050 0 54.1740 1
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.050 0 18.1600 8
5 282122 Connelly, Abshire and Von Beth Skiles GJ-90272 Shoes 20 96.62 1932.40 2016-03-17 10:19:05 0.045 250 336.9580 3
6 398620 Brekke Ltd Esequiel Schinner DU-87462 Shirt 10 67.64 676.40 2015-11-25 22:05:36 0.050 0 33.8200 11
7 218667 Jaskolski-O'Hara Trish Deckow DU-87462 Shirt 11 91.86 1010.46 2016-04-24 15:05:58 0.050 0 50.5230 4
8 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 18 98.67 1776.06 2015-08-06 08:09:56 0.050 0 88.8030 8
9 14406 Harber, Lubowitz and Fahey Esequiel Schinner GP-14407 Belt 12 64.48 773.76 2016-01-08 09:52:04 0.040 0 30.9504 1
10 887145 Gislason LLC Loring Predovic NZ-99565 Shirt 20 92.87 1857.40 2016-05-22 06:09:58 0.050 0 92.8700 5
11 93583 Hegmann-Howell Esequiel Schinner HZ-54995 Belt 17 40.56 689.52 2015-07-05 01:05:52 0.040 0 27.5808 7
12 280749 Douglas PLC Teagan O'Keefe TK-29646 Shoes 17 42.61 724.37 2016-05-16 19:11:55 0.030 0 21.7311 5
13 453258 Runolfsson-Bayer Shannen Hudson FI-01804 Shirt 20 12.31 246.20 2015-08-24 21:32:26 0.050 0 12.3100 8
14 453258 Runolfsson-Bayer Shannen Hudson EO-54210 Shirt 20 67.95 1359.00 2015-11-24 15:04:30 0.050 0 67.9500 11
15 280749 Douglas PLC Teagan O'Keefe GJ-90272 Shoes 17 23.20 394.40 2015-09-12 13:16:56 0.030 0 11.8320 9
16 453258 Runolfsson-Bayer Shannen Hudson NZ-99565 Shirt 2 30.23 60.46 2016-03-21 11:31:24 0.050 0 3.0230 3
17 280749 Douglas PLC Teagan O'Keefe GP-14407 Belt 3 22.64 67.92 2015-12-02 01:55:28 0.030 0 2.0376 12
18 398620 Brekke Ltd Esequiel Schinner HZ-54995 Belt 1 36.20 36.20 2015-12-07 18:45:54 0.030 0 1.0860 12
19 251881 Zulauf-Grady Teagan O'Keefe DU-87462 Shirt 9 82.06 738.54 2016-02-23 17:41:41 0.050 0 36.9270 2
20 575704 Lindgren, Thompson and Kirlin Teagan O'Keefe DU-87462 Shirt 5 28.51 142.55 2016-03-01 19:34:33 0.050 0 7.1275 3
21 887145 Gislason LLC Loring Predovic ZY-38455 Shirt 15 95.60 1434.00 2015-10-08 03:01:22 0.050 0 71.7000 10
22 575704 Lindgren, Thompson and Kirlin Teagan O'Keefe TK-29646 Shoes 11 30.80 338.80 2015-11-25 06:07:47 0.030 0 10.1640 11
23 282122 Connelly, Abshire and Von Beth Skiles HZ-54995 Belt 6 64.98 389.88 2016-01-15 21:07:30 0.030 0 11.6964 1
24 758030 Kilback-Abernathy Trish Deckow TK-29646 Shoes 19 26.90 511.10 2015-09-28 14:33:34 0.030 0 15.3330 9
25 280749 Douglas PLC Teagan O'Keefe NZ-99565 Shirt 15 30.34 455.10 2016-03-29 21:19:08 0.050 0 22.7550 3
26 680916 Mueller and Sons Loring Predovic DU-87462 Shirt 2 61.01 122.02 2016-05-19 00:05:12 0.050 0 6.1010 5
27 898496 Weissnat, Veum and Barton Ansley Cummings TL-23025 Shoes 8 18.34 146.72 2015-07-24 17:48:46 0.030 0 4.4016 7
28 530925 Purdy and Sons Teagan O'Keefe DU-87462 Shirt 10 18.40 184.00 2016-03-05 23:45:57 0.050 0 9.2000 3
29 752312 Watsica-Pfannerstill Loring Predovic GP-14407 Belt 15 17.93 268.95 2016-03-18 07:17:35 0.040 0 10.7580 3
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1470 680916 Mueller and Sons Loring Predovic HZ-54995 Belt 18 17.93 322.74 2016-04-28 01:57:00 0.040 0 12.9096 4
1471 218667 Jaskolski-O'Hara Trish Deckow TK-29646 Shoes 13 57.80 751.40 2015-11-09 12:17:52 0.030 0 22.5420 11
1472 887145 Gislason LLC Loring Predovic GP-14407 Belt 5 51.82 259.10 2016-06-08 11:25:05 0.030 0 7.7730 6
1473 93583 Hegmann-Howell Esequiel Schinner ZY-38455 Shirt 10 86.25 862.50 2016-05-10 23:48:57 0.050 0 43.1250 5
1474 530925 Purdy and Sons Teagan O'Keefe FI-01804 Shirt 13 10.13 131.69 2016-04-01 20:39:41 0.050 0 6.5845 4
1475 752312 Watsica-Pfannerstill Loring Predovic HZ-54995 Belt 18 23.48 422.64 2015-07-25 05:51:10 0.040 0 16.9056 7
1476 758030 Kilback-Abernathy Trish Deckow TK-29646 Shoes 7 86.67 606.69 2015-08-19 15:35:31 0.030 0 18.2007 8
1477 398620 Brekke Ltd Esequiel Schinner GP-14407 Belt 7 16.52 115.64 2016-04-07 10:25:42 0.030 0 3.4692 4
1478 218667 Jaskolski-O'Hara Trish Deckow GJ-90272 Shoes 6 36.44 218.64 2015-11-02 20:55:11 0.030 0 6.5592 11
1479 14406 Harber, Lubowitz and Fahey Esequiel Schinner HZ-54995 Belt 5 48.52 242.60 2016-05-22 12:34:35 0.030 0 7.2780 5
1480 218667 Jaskolski-O'Hara Trish Deckow TL-23025 Shoes 14 88.33 1236.62 2015-07-12 15:59:56 0.045 250 305.6479 7
1481 530925 Purdy and Sons Teagan O'Keefe HZ-54995 Belt 9 62.85 565.65 2015-08-12 17:07:20 0.030 0 16.9695 8
1482 575704 Lindgren, Thompson and Kirlin Teagan O'Keefe GP-14407 Belt 9 55.57 500.13 2016-02-27 03:20:13 0.030 0 15.0039 2
1483 758030 Kilback-Abernathy Trish Deckow GJ-90272 Shoes 8 98.87 790.96 2016-02-10 16:51:59 0.030 0 23.7288 2
1484 136521 Labadie and Sons Esequiel Schinner FI-01804 Shirt 15 99.43 1491.45 2015-12-17 21:58:31 0.050 0 74.5725 12
1485 898496 Weissnat, Veum and Barton Ansley Cummings GJ-90272 Shoes 18 11.05 198.90 2016-04-01 10:19:26 0.030 0 5.9670 4
1486 136521 Labadie and Sons Esequiel Schinner HZ-54995 Belt 20 54.67 1093.40 2016-02-20 10:43:41 0.040 0 43.7360 2
1487 218667 Jaskolski-O'Hara Trish Deckow TL-23025 Shoes 19 45.97 873.43 2015-08-12 22:53:38 0.030 0 26.2029 8
1488 575704 Lindgren, Thompson and Kirlin Teagan O'Keefe TL-23025 Shoes 1 55.81 55.81 2016-02-01 17:50:13 0.030 0 1.6743 2
1489 887145 Gislason LLC Loring Predovic ZY-38455 Shirt 10 16.37 163.70 2015-10-11 22:11:45 0.050 0 8.1850 10
1490 898496 Weissnat, Veum and Barton Ansley Cummings GP-14407 Belt 8 83.37 666.96 2016-01-27 03:26:53 0.030 0 20.0088 1
1491 62592 O'Keefe-Koch Shannen Hudson TK-29646 Shoes 18 24.92 448.56 2016-02-11 17:13:23 0.030 0 13.4568 2
1492 530925 Purdy and Sons Teagan O'Keefe HZ-54995 Belt 3 90.33 270.99 2015-09-23 07:36:34 0.030 0 8.1297 9
1493 93583 Hegmann-Howell Esequiel Schinner GP-14407 Belt 5 45.93 229.65 2016-04-13 22:34:56 0.030 0 6.8895 4
1494 14406 Harber, Lubowitz and Fahey Esequiel Schinner ZY-38455 Shirt 16 21.60 345.60 2015-11-18 06:28:56 0.050 0 17.2800 11
1495 136521 Labadie and Sons Esequiel Schinner GP-14407 Belt 4 98.57 394.28 2016-06-01 17:28:44 0.030 0 11.8284 6
1496 575704 Lindgren, Thompson and Kirlin Teagan O'Keefe TK-29646 Shoes 3 65.16 195.48 2016-04-02 16:38:31 0.030 0 5.8644 4
1497 898496 Weissnat, Veum and Barton Ansley Cummings EO-54210 Shirt 17 28.10 477.70 2015-07-20 19:30:10 0.050 0 23.8850 7
1498 62592 O'Keefe-Koch Shannen Hudson FI-01804 Shirt 19 94.96 1804.24 2015-10-05 15:55:01 0.050 0 90.2120 10
1499 530925 Purdy and Sons Teagan O'Keefe DU-87462 Shirt 3 22.86 68.58 2015-10-07 19:49:38 0.050 0 3.4290 10

1500 rows × 13 columns

df.head()
account number customer name sales rep sku category quantity unit price ext price date commission bonus comp month
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.04 0 67.2524 11
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.05 0 11.7105 2
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.05 0 28.6995 8
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.05 0 54.1740 1
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.05 0 18.1600 8
df.groupby("category").agg({"quantity":["count","size"],"unit price":["sum"],"ext price":['mean']})
quantity unit price ext price
count size sum mean
category
Belt 298 298 15754.05 548.017852
Shirt 734 734 41696.16 581.485817
Shoes 468 468 25524.53 571.603162

What is nice about the pivot table, is that you have three dataframe parameters, index, columns, and values to adjust, wheras the groupby function only use index and values.

df.pivot_table(index=["month","sales rep"],columns=["category"], values=["bonus","comp"],aggfunc=[np.sum, np.size],fill_value="No Sale").head(12)
sum size
bonus comp bonus comp
category Belt Shirt Shoes Belt Shirt Shoes Belt Shirt Shoes Belt Shirt Shoes
month sales rep
1 Ansley Cummings 0 0 0 84.7025 34.094 38.3076 3 3 2 3 3 2
Beth Skiles 0 0 0 24.4992 35.6795 17.9136 3 2 1 3 2 1
Esequiel Schinner 0 0 750 120.595 348.077 1013.11 3 13 8 3 13 8
Loring Predovic 0 0 250 37.9589 252.68 365.321 6 8 4 6 8 4
Shannen Hudson 0 0 0 116.422 201.026 1.6809 3 6 1 3 6 1
Teagan O'Keefe 0 0 0 42.5652 660.813 28.7259 2 21 3 2 21 3
Trish Deckow 0 0 0 113.095 172.417 20.0688 7 8 3 7 8 3
2 Ansley Cummings No Sale 0 250 No Sale 10.0695 336.991 No Sale 1 4 No Sale 1 4
Beth Skiles No Sale 0 1000 No Sale 17.809 1244.55 No Sale 2 5 No Sale 2 5
Esequiel Schinner 0 0 250 153.831 174.106 413.933 6 8 9 6 8 9
Loring Predovic 0 0 1000 82.4416 366.779 1345 4 11 10 4 11 10
Shannen Hudson 0 0 250 38.8498 152.086 333.645 3 4 3 3 4 3

In the privot table below, only certain calculations are applied to certain columns. It is more selective.

# this is whithout brackets, so the type of calculation is not displayed
#df.pivot_table(index=["month","sales rep"],columns=["category"], values=["comp","bonus"],aggfunc={"comp":np.sum, "bonus":np.size},fill_value="No Sale").head(12)
df.pivot_table(index=["month","sales rep"],aggfunc={"comp":[np.sum], "bonus":[np.size]},columns=["category"], values=["comp","bonus"],fill_value="No Sale").head(12)
bonus comp
size sum
category Belt Shirt Shoes Belt Shirt Shoes
month sales rep
1 Ansley Cummings 3 3 2 84.7025 34.094 38.3076
Beth Skiles 3 2 1 24.4992 35.6795 17.9136
Esequiel Schinner 3 13 8 120.595 348.077 1013.11
Loring Predovic 6 8 4 37.9589 252.68 365.321
Shannen Hudson 3 6 1 116.422 201.026 1.6809
Teagan O'Keefe 2 21 3 42.5652 660.813 28.7259
Trish Deckow 7 8 3 113.095 172.417 20.0688
2 Ansley Cummings No Sale 1 4 No Sale 10.0695 336.991
Beth Skiles No Sale 2 5 No Sale 17.809 1244.55
Esequiel Schinner 6 8 9 153.831 174.106 413.933
Loring Predovic 4 11 10 82.4416 366.779 1345
Shannen Hudson 3 4 3 38.8498 152.086 333.645
df_pivot = df.pivot_table(index=["month","sales rep"],aggfunc={"comp":[np.sum], "bonus":[np.size]},columns=["category"], values=["comp","bonus"],fill_value="No Sale")

You can now if you feel the need to, do some querying

df_pivot.query("month == [1]")
bonus comp
size sum
category Belt Shirt Shoes Belt Shirt Shoes
month sales rep
1 Ansley Cummings 3 3 2 84.7025 34.094 38.3076
Beth Skiles 3 2 1 24.4992 35.6795 17.9136
Esequiel Schinner 3 13 8 120.595 348.077 1013.11
Loring Predovic 6 8 4 37.9589 252.68 365.321
Shannen Hudson 3 6 1 116.422 201.026 1.6809
Teagan O'Keefe 2 21 3 42.5652 660.813 28.7259
Trish Deckow 7 8 3 113.095 172.417 20.0688

This is another way to do it, I find it more reliable

df_pivot[df_pivot.index.get_level_values(0).isin([1])]
bonus comp
size sum
category Belt Shirt Shoes Belt Shirt Shoes
month sales rep
1 Ansley Cummings 3 3 2 84.7025 34.094 38.3076
Beth Skiles 3 2 1 24.4992 35.6795 17.9136
Esequiel Schinner 3 13 8 120.595 348.077 1013.11
Loring Predovic 6 8 4 37.9589 252.68 365.321
Shannen Hudson 3 6 1 116.422 201.026 1.6809
Teagan O'Keefe 2 21 3 42.5652 660.813 28.7259
Trish Deckow 7 8 3 113.095 172.417 20.0688
df_pivot[df_pivot.index.get_level_values(1).isin(["Ansley Cummings"])]

bonus comp
size sum
category Belt Shirt Shoes Belt Shirt Shoes
month sales rep
1 Ansley Cummings 3 3 2 84.7025 34.094 38.3076
2 Ansley Cummings No Sale 1 4 No Sale 10.0695 336.991
3 Ansley Cummings No Sale 2 2 No Sale 48.57 313.561
4 Ansley Cummings 1 2 2 50.112 62.678 10.4796
5 Ansley Cummings 1 3 No Sale 25.172 76.116 No Sale
6 Ansley Cummings 2 4 2 25.9473 54.596 309.125
7 Ansley Cummings No Sale 2 2 No Sale 36.5295 21.903
8 Ansley Cummings 2 4 2 51.198 90.2805 29.8443
9 Ansley Cummings 2 3 2 39.3054 49.317 21.0768
10 Ansley Cummings 1 4 2 20.1152 154.215 336.159
11 Ansley Cummings 2 2 No Sale 8.0844 72.3245 No Sale
12 Ansley Cummings 2 5 4 73.1176 164.834 50.8647
df.head()
account number customer name sales rep sku category quantity unit price ext price date commission bonus comp month
0 680916 Mueller and Sons Loring Predovic GP-14407 Belt 19 88.49 1681.31 2015-11-17 05:58:34 0.04 0 67.2524 11
1 680916 Mueller and Sons Loring Predovic FI-01804 Shirt 3 78.07 234.21 2016-02-13 04:04:11 0.05 0 11.7105 2
2 530925 Purdy and Sons Teagan O'Keefe EO-54210 Shirt 19 30.21 573.99 2015-08-11 12:44:38 0.05 0 28.6995 8
3 14406 Harber, Lubowitz and Fahey Esequiel Schinner NZ-99565 Shirt 12 90.29 1083.48 2016-01-23 02:15:50 0.05 0 54.1740 1
4 398620 Brekke Ltd Esequiel Schinner NZ-99565 Shirt 5 72.64 363.20 2015-08-10 07:16:03 0.05 0 18.1600 8